{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Text classification with the torchtext library\n\nIn this tutorial, we will show how to use the torchtext library to build the dataset for the text classification analysis. Users will have the flexibility to\n\n   - Access to the raw data as an iterator\n   - Build data processing pipeline to convert the raw text strings into ``torch.Tensor`` that can be used to train the model\n   - Shuffle and iterate the data with [torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader)_\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Access to the raw dataset iterators\n\nThe torchtext library provides a few raw dataset iterators, which yield the raw text strings. For example, the ``AG_NEWS`` dataset iterators yield the raw data as a tuple of label and text.\n\nTo access torchtext datasets, please install torchdata following instructions at https://github.com/pytorch/data. \n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nfrom torchtext.datasets import AG_NEWS\ntrain_iter = iter(AG_NEWS(split='train'))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "::\n\n    next(train_iter)\n    >>> (3, \"Fears for T N pension after talks Unions representing workers at Turner\n    Newall say they are 'disappointed' after talks with stricken parent firm Federal\n    Mogul.\")\n\n    next(train_iter)\n    >>> (4, \"The Race is On: Second Private Team Sets Launch Date for Human\n    Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\\\\team of\n    rocketeers competing for the  #36;10 million Ansari X Prize, a contest\n    for\\\\privately funded suborbital space flight, has officially announced\n    the first\\\\launch date for its manned rocket.\")\n\n    next(train_iter)\n    >>> (4, 'Ky. Company Wins Grant to Study Peptides (AP) AP - A company founded\n    by a chemistry researcher at the University of Louisville won a grant to develop\n    a method of producing better peptides, which are short chains of amino acids, the\n    building blocks of proteins.')\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prepare data processing pipelines\n\nWe have revisited the very basic components of the torchtext library, including vocab, word vectors, tokenizer. Those are the basic data processing building blocks for raw text string.\n\nHere is an example for typical NLP data processing with tokenizer and vocabulary. The first step is to build a vocabulary with the raw training dataset. Here we use built in\nfactory function `build_vocab_from_iterator` which accepts iterator that yield list or iterator of tokens. Users can also pass any special symbols to be added to the\nvocabulary.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torchtext.data.utils import get_tokenizer\nfrom torchtext.vocab import build_vocab_from_iterator\n\ntokenizer = get_tokenizer('basic_english')\ntrain_iter = AG_NEWS(split='train')\n\ndef yield_tokens(data_iter):\n    for _, text in data_iter:\n        yield tokenizer(text)\n\nvocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=[\"<unk>\"])\nvocab.set_default_index(vocab[\"<unk>\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The vocabulary block converts a list of tokens into integers.\n\n::\n\n    vocab(['here', 'is', 'an', 'example'])\n    >>> [475, 21, 30, 5297]\n\nPrepare the text processing pipeline with the tokenizer and vocabulary. The text and label pipelines will be used to process the raw data strings from the dataset iterators.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "text_pipeline = lambda x: vocab(tokenizer(x))\nlabel_pipeline = lambda x: int(x) - 1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. The label pipeline converts the label into integers. For example,\n\n::\n\n    text_pipeline('here is the an example')\n    >>> [475, 21, 2, 30, 5297]\n    label_pipeline('10')\n    >>> 9\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Generate data batch and iterator\n\n[torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader)_\nis recommended for PyTorch users (a tutorial is [here](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html)_).\nIt works with a map-style dataset that implements the ``getitem()`` and ``len()`` protocols, and represents a map from indices/keys to data samples. It also works with an iterable dataset with the shuffle argument of ``False``.\n\nBefore sending to the model, ``collate_fn`` function works on a batch of samples generated from ``DataLoader``. The input to ``collate_fn`` is a batch of data with the batch size in ``DataLoader``, and ``collate_fn`` processes them according to the data processing pipelines declared previously. Pay attention here and make sure that ``collate_fn`` is declared as a top level def. This ensures that the function is available in each worker.\n\nIn this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of ``nn.EmbeddingBag``. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torch.utils.data import DataLoader\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n\ndef collate_batch(batch):\n    label_list, text_list, offsets = [], [], [0]\n    for (_label, _text) in batch:\n         label_list.append(label_pipeline(_label))\n         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)\n         text_list.append(processed_text)\n         offsets.append(processed_text.size(0))\n    label_list = torch.tensor(label_list, dtype=torch.int64)\n    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)\n    text_list = torch.cat(text_list)\n    return label_list.to(device), text_list.to(device), offsets.to(device)\n\ntrain_iter = AG_NEWS(split='train')\ndataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Define the model\n\nThe model is composed of the [nn.EmbeddingBag](https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag)_ layer plus a linear layer for the classification purpose. ``nn.EmbeddingBag`` with the default mode of \"mean\" computes the mean value of a \u201cbag\u201d of embeddings. Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since the text lengths are saved in offsets.\n\nAdditionally, since ``nn.EmbeddingBag`` accumulates the average across\nthe embeddings on the fly, ``nn.EmbeddingBag`` can enhance the\nperformance and memory efficiency to process a sequence of tensors.\n\n<img src=\"file://../_static/img/text_sentiment_ngrams_model.png\">\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torch import nn\n\nclass TextClassificationModel(nn.Module):\n\n    def __init__(self, vocab_size, embed_dim, num_class):\n        super(TextClassificationModel, self).__init__()\n        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)\n        self.fc = nn.Linear(embed_dim, num_class)\n        self.init_weights()\n\n    def init_weights(self):\n        initrange = 0.5\n        self.embedding.weight.data.uniform_(-initrange, initrange)\n        self.fc.weight.data.uniform_(-initrange, initrange)\n        self.fc.bias.data.zero_()\n\n    def forward(self, text, offsets):\n        embedded = self.embedding(text, offsets)\n        return self.fc(embedded)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Initiate an instance\n\nThe ``AG_NEWS`` dataset has four labels and therefore the number of classes is four.\n\n::\n\n   1 : World\n   2 : Sports\n   3 : Business\n   4 : Sci/Tec\n\nWe build a model with the embedding dimension of 64. The vocab size is equal to the length of the vocabulary instance. The number of classes is equal to the number of labels,\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "train_iter = AG_NEWS(split='train')\nnum_class = len(set([label for (label, text) in train_iter]))\nvocab_size = len(vocab)\nemsize = 64\nmodel = TextClassificationModel(vocab_size, emsize, num_class).to(device)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Define functions to train the model and evaluate results.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import time\n\ndef train(dataloader):\n    model.train()\n    total_acc, total_count = 0, 0\n    log_interval = 500\n    start_time = time.time()\n\n    for idx, (label, text, offsets) in enumerate(dataloader):\n        optimizer.zero_grad()\n        predicted_label = model(text, offsets)\n        loss = criterion(predicted_label, label)\n        loss.backward()\n        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)\n        optimizer.step()\n        total_acc += (predicted_label.argmax(1) == label).sum().item()\n        total_count += label.size(0)\n        if idx % log_interval == 0 and idx > 0:\n            elapsed = time.time() - start_time\n            print('| epoch {:3d} | {:5d}/{:5d} batches '\n                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),\n                                              total_acc/total_count))\n            total_acc, total_count = 0, 0\n            start_time = time.time()\n\ndef evaluate(dataloader):\n    model.eval()\n    total_acc, total_count = 0, 0\n\n    with torch.no_grad():\n        for idx, (label, text, offsets) in enumerate(dataloader):\n            predicted_label = model(text, offsets)\n            loss = criterion(predicted_label, label)\n            total_acc += (predicted_label.argmax(1) == label).sum().item()\n            total_count += label.size(0)\n    return total_acc/total_count"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Split the dataset and run the model\n\nSince the original ``AG_NEWS`` has no valid dataset, we split the training\ndataset into train/valid sets with a split ratio of 0.95 (train) and\n0.05 (valid). Here we use\n[torch.utils.data.dataset.random_split](https://pytorch.org/docs/stable/data.html?highlight=random_split#torch.utils.data.random_split)_\nfunction in PyTorch core library.\n\n[CrossEntropyLoss](https://pytorch.org/docs/stable/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss)_\ncriterion combines ``nn.LogSoftmax()`` and ``nn.NLLLoss()`` in a single class.\nIt is useful when training a classification problem with C classes.\n[SGD](https://pytorch.org/docs/stable/_modules/torch/optim/sgd.html)_\nimplements stochastic gradient descent method as the optimizer. The initial\nlearning rate is set to 5.0.\n[StepLR](https://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html#StepLR)_\nis used here to adjust the learning rate through epochs.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torch.utils.data.dataset import random_split\nfrom torchtext.data.functional import to_map_style_dataset\n# Hyperparameters\nEPOCHS = 10 # epoch\nLR = 5  # learning rate\nBATCH_SIZE = 64 # batch size for training\n\ncriterion = torch.nn.CrossEntropyLoss()\noptimizer = torch.optim.SGD(model.parameters(), lr=LR)\nscheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)\ntotal_accu = None\ntrain_iter, test_iter = AG_NEWS()\ntrain_dataset = to_map_style_dataset(train_iter)\ntest_dataset = to_map_style_dataset(test_iter)\nnum_train = int(len(train_dataset) * 0.95)\nsplit_train_, split_valid_ = \\\n    random_split(train_dataset, [num_train, len(train_dataset) - num_train])\n\ntrain_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,\n                              shuffle=True, collate_fn=collate_batch)\nvalid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,\n                              shuffle=True, collate_fn=collate_batch)\ntest_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,\n                             shuffle=True, collate_fn=collate_batch)\n\nfor epoch in range(1, EPOCHS + 1):\n    epoch_start_time = time.time()\n    train(train_dataloader)\n    accu_val = evaluate(valid_dataloader)\n    if total_accu is not None and total_accu > accu_val:\n      scheduler.step()\n    else:\n       total_accu = accu_val\n    print('-' * 59)\n    print('| end of epoch {:3d} | time: {:5.2f}s | '\n          'valid accuracy {:8.3f} '.format(epoch,\n                                           time.time() - epoch_start_time,\n                                           accu_val))\n    print('-' * 59)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Evaluate the model with test dataset\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Checking the results of the test dataset\u2026\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print('Checking the results of test dataset.')\naccu_test = evaluate(test_dataloader)\nprint('test accuracy {:8.3f}'.format(accu_test))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Test on a random news\n\nUse the best model so far and test a golf news.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "ag_news_label = {1: \"World\",\n                 2: \"Sports\",\n                 3: \"Business\",\n                 4: \"Sci/Tec\"}\n\ndef predict(text, text_pipeline):\n    with torch.no_grad():\n        text = torch.tensor(text_pipeline(text))\n        output = model(text, torch.tensor([0]))\n        return output.argmax(1).item() + 1\n\nex_text_str = \"MEMPHIS, Tenn. \u2013 Four days ago, Jon Rahm was \\\n    enduring the season\u2019s worst weather conditions on Sunday at The \\\n    Open on his way to a closing 75 at Royal Portrush, which \\\n    considering the wind and the rain was a respectable showing. \\\n    Thursday\u2019s first round at the WGC-FedEx St. Jude Invitational \\\n    was another story. With temperatures in the mid-80s and hardly any \\\n    wind, the Spaniard was 13 strokes better in a flawless round. \\\n    Thanks to his best putting performance on the PGA Tour, Rahm \\\n    finished with an 8-under 62 for a three-stroke lead, which \\\n    was even more impressive considering he\u2019d never played the \\\n    front nine at TPC Southwind.\"\n\nmodel = model.to(\"cpu\")\n\nprint(\"This is a %s news\" %ag_news_label[predict(ex_text_str, text_pipeline)])"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}